logo

Replay Time Travelogue: Improving Nadia’s “Debugging with AI” Results Using Replay MCP

profile photo
Mark Erikson
In my previous post How Replay MCP Helped Find a React Bug Faster Than Dan Abramov Did, I showed how agents were able to successfully identify and fix a difficult React internals bug using just Replay recordings of the reproduction, and how different prompts and instructions resulted in faster analysis results.
However, most of us aren’t actually working on React’s internals :) We’re using React in apps , and dealing with app-level concerns: page loading times, network requests, good UX, and data correctness. What happens when you use AI to debug those kinds of problems? Can Replay MCP make agents better at finding real-world app bugs?

Background: Nadia’s “Debugging with AI” Example

Nadia Makarevich is author of the book Advanced React, as well as the excellent blog Developer Way, where she’s published numerous posts on React’s behavior and usage. She recently wrote a post titled “Debugging with AI: Can It Replace an Experienced Developer?”, where she tested LLM debugging abilities against an example Next.js app that had several representative bugs inserted to recreate problems she’d seen in real apps. She talked through her instructions to the LLM, how well it did solving each bug, and described where it came up with insufficient answers or just failed to solve the problem at all.
Nadia used Opus 4.5 to do realistic “blind” investigations on three different app bugs: a Zod schema error, a double loading skeleton, and a Next.js redirect error. She graded the LLM’s results in three ways:
  1. Did the LLM fix the problem?
  1. Did the LLM identify the root cause of the problem correctly?
  1. Did the LLM fix the root cause of the problem correctly?
Her results and final analysis indicated that while the LLM had some success, it struggled to understand the deeper architectural issues and either failed to identify the root cause or gave misleading answers:
User Page is Broken
Double skeletons problem
Weird Redirect Error
Did LLM fix the problem?
Did LLM identify the root cause of the problem correctly?
Did LLM fix the root cause of the problem correctly?
As she concluded:
💡
So, what's the verdict? Can AI replace an experienced developer for debugging?
No, of course not. AI is very good at pattern recognition, and it can be very, very useful. It will do exceptionally well on standard stuff: schema validation mishaps, forgotten null checks, common runtime errors, you've seen it for yourself already
But even if the fix works, I always retrace it step by step and make sure it's the actual root cause. If the fix doesn't work, however, I almost never iterate with the LLM, other than asking it to find me something in the repo. As you've seen, half of its responses in this case will be very confident hallucinations.
When the problem requires actual understanding of why the system behaves the way it does, or how it should behave, especially from a future or user perspective, the AI fails.
And the skill here isn't knowing how to prompt better. It's knowing when to stop prompting and start thinking.
I took this as a challenge :) I wanted to see if adding Replay recordings and Replay MCP to the mix would help an agent solve these problems any better. I also wanted to see if improving prompts and providing more context could produce better results.

Agentic Time-Travel Debugging with Replay MCP

Let’s step back and recap what Replay MCP actually is.
Replay is a time-travel debugger for web apps. The Replay recording browser captures DVR-style recordings of an entire web app. Because we capture the entire browser’s behavior and inputs, we can replay the app’s execution exactly the way it ran during the original recording. Unlike session replay or prod monitoring tools, Replay lets you do time-travel debugging and inspect the app’s behavior at any point in time: see which lines of code executed, evaluate log statements for every time a line ran, view the DOM tree and React component tree at any point in time, and much more. This makes it possible to investigate and solve bugs in ways no other tool can.
Replay DevTools is our debugging UI for humans - browser devtools with time-travel built in. Replay MCP gives agents those same time-travel debugging capabilities. Agents can open a recording and use the MCP tools to investigate the same way a human would: looking at console messages, adding logpoints to evaluate expressions each time a line of code ran, getting screenshots and stack traces, and getting framework-specific insights into libraries like React, Redux, Zustand, and TanStack Query. This means agents can now do the investigation work for you automatically!

Investigation Process and Setup

For this experiment, I used my own personal agent setup: OpenCode 1.4 and Opus 4.6. I have some file search and context management plugins enabled, but otherwise no specific skills or custom behaviors.
I wanted to run this as a proper experiment: how much do prompts and investigation instructions matter, and how much does Replay MCP help? Bug #1 was pretty trivial, but for Bug #2 and Bug #3, I kicked off four parallel agent investigation sessions. Each agent session was given the same access to the Replay recordings of the bug and a local copy of the demo app source, but with varying prompt instructions:
  1. Light details, bug repro, only told to “write a bug report with a root cause and suggested fix”
  1. Same repro steps, but a detailed 8-step investigation methodology including explaining why the problem is happening
  1. Additional context on the app’s architecture and patterns
  1. Additional relevant framework skills or concepts
How would they do? What differences would we see between them in results or investigation times?

Bug #1: Zod Schema Error in a User Profile

Bug #1 was pretty trivial. Navigate to /users/1 , and there’s a Zod schema error shown in both a UI error overlay and in the console messages. Nadia’s initial prompt was:
💡
When I navigate to the /users/1 route, the page shows a "Something went wrong" error. In the browser console, I see the following errors [copy-paste the entire error stack]. Server logs are clean. Can you fix it?
Her agent found this error immediately and correctly identified the fix: making the user address fields optional, and handling the type changes when rendering user.address .
My agent was given somewhat more detailed instructions on investigating and using the MCP, but this one was a gimme anyway. It called the RecordingOverview MCP tool, immediately saw the Zod error, used Screenshot to confirm the UI overlay, then used the NetworkRequest tool to check the actual API response and confirmed the address and phone fields were missing despite the schema marking them as required. It nailed the issue and cause in about 2 minutes.
Bug #1: Agent prompt
Bug #1: Agent final analysis

Bug #2: User Page Double Loading Skeleton

After fixing the first issue, Nadia described the second bug in the app: throttle the network connection, navigate from the root page to a user profile, and see two different loading skeleton flash on screen before the final content is displayed. But, loading the user profile directly only shows one skeleton.
Nadia’s prompt was:
💡
When I'm on the root page and navigate to the User page, I see two different loading skeletons appear. However, when I just refresh the User page, I see only one. Why is that and how to fix it?
She reported that her agent came up with a variety of answers, several of which involved switching from TanStack Query’s useQuery hook to useSuspenseQuery . That answer is half right, but switching to useSuspenseQuery uncovered a second problem.
The double skeletons had to do with mixing Next’s loading.tsx page Suspense boundary, with an app-level skeleton component based on the TSQ isLoading flag. However, just switching to useSuspenseQuery introduces a hydration mismatch error down the line. The real fix is architectural - actually prefetching data on the server, and hydrating the query hook on the client side with the same data.
Given that, I wanted to see: would an agent with Replay MCP actually find the real root cause and suggest server prefetching as the full solution?
I kicked off three agent sessions: Nadia’s baseline prompt, additional investigative method instructions, and method + some app architecture context. The results were partially an improvement on Nadia’s agent - here’s how they did.

Bug #2 Agent #1: Basic Prompt

Agent #1 started by inspecting the Replay recording to understand the overall behavior. It called RecordingOverview to get the initial details, UserInteractions to get a list of clicks, and and made numerous Screenshot to get visuals on the navigation. It then used ListSources and SearchSources to identify the Users page routes and checked for Suspense handling.
A ReadSource call for UserPage.tsx returned hit counts showing it had rendered twice, once with isLoading: true. The agent then made multiple ReactRenders tools calls to check the actual render sequence and triggers. The render details showed a mixture of Suspense and loading-state based causes, confirming the mismatch. Finally, it was able to read the local repo files on disk to confirm the available hooks.
Overall, this agent did a reasonable actual investigation, came up with real evidence, wrapped up the search and reported. The answer was directionally correct, just not complete enough.
Total time: 5 minutes
Bug #2 Agent #1 Prompt
Bug #2 Agent #1 Solution

Bug #2 Agent #2: Investigative Process

Agent #2 was given a more detailed 8-step investigative process to follow. How much of a difference would that make?
Agent #2 started with the same initial tool calls: RecordingOverview to orient itself, UserInteractions and Screenshot to understand the usage. It also used NetworkRequest to check the timestamps of calls to /api/user and /api/current-user.
After calling ListSources to check the files in the recording, it opted to read them all from the local filesystem.
From there, it used ReactRenders to focus in on the critical Suspense-related renders and causes, but also started reading the various Next routes to see how they handled loading patterns. It identified that most routes did have an isLoading check, but only the users/[id] route had a loading.tsx file included, explaining why the bug is specific to the User page.
This agent came up with much better explanations and relevant details in its report. It identified a waterfall request happening, and found the loading skeleton mismatch. Its primary suggestion was the same incomplete useSuspenseQuery fix, but it did explicitly say “Ideal optimization would be server-side prefetching to eliminate the waterfall”.
Total time: 9 minutes
Bug #2 Agent #2 Prompt
Bug #2 Agent #2 Solution

Bug #2 Agent #3: App Architecture Summary

Agent #3 was given the same investigative steps, but also a few bullet points summarizing the known packages and structure of this app (Next.js + TanStack Query, mix of RSCs and hook loading).
Most of the investigative flow was the same: RecordingOverview, interactions, screenshots, sources, network requests. It then specifically browsed most of the query hook definitions and confirmed that Suspense-enabled query hooks existed, as well as checking more screenshots to confirm the loading skeleton visuals.
This agent came up with the fastest response. It still recommended useSuspenseQuery, but did also list server prefetching as a follow-up step. It also flagged OrderPage and ProductPage as having the same problematic pattern, and that the UserPageSkeleton component would be dead and could be removed.
Total time: 4 minutes
Bug #3 Agent #3 Prompt
Bug #2 Agent #3 Solution

Bug #2: Initial Results

After running those three sessions, I was pleased to see that the agents had at least found the initial problems and identified the same useSuspenseQuery fix that Nadia’s agent had suggested. But, as she pointed out, that “fix” is incomplete and actually leads to hydration errors.
Meanwhile, it was clear that adding more codebase context and better instructions produces higher quality answers and faster results. Even just adding a few bullet points describing the rough loading patterns used in the app let Agent #3 zero in on the problem much faster.
So, what if I tried giving the agent more info on the actual technologies used in this app? More specifically, what if I gave it access to the official SKILL.md files for Next.js and TanStack Query? How would that change the results? Would the agent be able to identify the real root cause and suggest server prefetching as an actual solution?

Bug #2 Agent #4: Skill Files

I spun up Agent #4 with the same instructions as before (bug report, investigation steps, app context), but added in two extra sections: “CI context” saying this was a manual repro of a known E2E test failure and listing project dependencies, and specific pointers to the SKILL files for TanStack Query, Next.js App Router data fetching, Next.js Suspense boundaries, and Vercel’s “React best practices”.
This agent started by actively reading the skill files. From there, most of the investigation followed the same paths: Replay MCP tools to scan the recording, get screenshots, network requests, sources, and React renders. It also cross-checked all of the query hook files and the different app pages.
However, this time the result was different. While it still recommended useSuspenseQuery as the primary fix, the agent explicitly recommended “Deeper fix: eliminate the waterfall with server-side prefetching” and gave a copy-pasteable example showing use of queryClient.prefetchQuery() in the server component that exactly matched Nadia’s recommended fix! That is the correct answer!
This agent also identified several other concerns, including the Order and Product pages having the same issue, the query key not including userId, and other loading state issues.
Total time: 5 minutes
Bug #2 Agent #4 Prompt
Bug #2 Agent #4 Solution

Bug #2 Conclusions

I like Nadia’s rubric for measuring success, but I think we can expand it a bit. There’s both “did it come up with the right answers?”, and “how did it get those answers?”. (This is where I always struggled as a kid in school - I wrote the right answers, why should I have to write down my work to show how I got there? 🤣)
Looking at the four agents, here’s how we’d grade their results:
Run 1 (baseline)
Run 2 (+methodology)
Run 3 (+context)
Run 4 (+skills)
Did it fix the problem?
Did it identify the root cause correctly?
⚠️ Proximate only
⚠️ Proximate only
⚠️ Proximate only
Did it investigate beyond the first fix?
Did it check the rest of the codebase?
Partial
Did it produce the correct architectural fix?
The “fix” rating for agents 1-3 is a bit generous given that we know that useSuspenseQuery wasn’t the complete answer. Still, it does address the immediate issue, and if that change was pushed in a PR, the next test run would have had the hydration error and eventually led to implementing the prefetching solution. So, call that a step in the right direction - it’s investigative progress that you as a developer didn’t have to do yourself.
I was impressed to see that just adding a detailed investigative process in the prompt produced better results, faster. I was very impressed to see how much of a difference adding the skills files made! I honestly didn’t know if that would help, but figured it was worth trying for comparison, and it actually did lead to a full and complete architectural fix suggestion.

Bug #3: Next.js Redirect Error

Nadia’s third bug investigation involves a redirect error. Load /users, it does correctly redirect to /users/1 so that non-admins see their own user details. However, there’s a Rendered more hooks than during the previous render React error that shows up.
Nadia’s LLM prompt was again pretty simple:
💡
When I navigate to the /users page, I see a "Something went wrong" page, and in the console I see the following error: [error code]. Investigate why the error happens and how to fix it.
She reported that her LLM completely failed in this case. It suggested a half dozen causes and “believably detailed” potential solutions, all completely wrong.
Nadia had to dig through Next’s Github issues, found some semi-relevant discussions, but no firm fix. Eventually she started ripping out code from her app, and after “multiple days”, found that the <SendAnalyticsData> component was calling a Server Action function inside of a useEffect . Apparently this was actually due to a combination of a server-side redirect, and a Server Action in progress wrapped in Suspense (per this Next.js issue ). That’s incredibly obscure - no wonder her agent couldn’t figure it out!
I created a fresh Replay recording of the bug and set up the same parallel agent sessions as before. This time, the question was: would Replay MCP would enable any of my agent sessions to even come close to describing the real bug and fix? Or would this also be too obscure even with runtime data available?

Bug #3 Agent #0: No Recording

I tried doing one agent session with just access to the source to see what would happen. Interestingly, it did label <SendAnalyticsData> was the issue, but came up with a completely bogus explanation - it claimed it had to do with component position after an early return, and said it should be moved earlier in the return logic. Clearly the source code alone wasn’t sufficient to explain what was going on here.

Bug #3 Agent #1: Basic Instructions

Agent #1 went on an incredibly deep spelunking expedition through Next and React’s internals :)
It started with the usual RecordingOverview tool, which immediately surfaced the error message. It got more details via the ReactExceptions and ConsoleMessages tools, identified the Router component as the source of the thrown error, and dug into React’s behavior with the ReactRenders tool.
After reading all of the relevant app source files, it read the source of Route via the sources tools, saw the hit counts, and jumped from there to Next’s useActionQueue internal hook. It identified all the internal hook calls and came up with 11 hooks, and flagged the line return isThenable(state) ? use(state) : state as being relevant. It then used multiple Logpoint and Evaluate calls to trace more runtime execution inside of React internal methods like updateWorkInProgressHook().
The final recommended fix was to move the redirect into middleware. This would sort of resolve the problem, but only as a bandaid - it doesn’t address the Server Action being called in a useEffect .
Still, it’s incredibly impressive how deep inside React’s internals this agent was able to investigate and how much of the actual behavior it identified in the process!
Total time: 15 minutes
Bug #3 Agent #1 Prompt
Bug #3 Agent #1 Solution

Bug #3 Agent #2: Investigative Methodology

Agent #2 was given the same 8-step investigative approach we used earlier with Bug #2.
And Agent #2 actually nailed the exact React behavior and suggested fix!
Agent #2 started with RecordingOverview, saw the Next redirect error, and took a few screenshots to confirm visuals. It then used ReactRenders to dig into the specific React commits around the redirect.
After using SearchSources and ReadSources to read the full Router component and its hit counts, it used ReactRenders again to get the specific React fiber ID for the Router instance. It identified the render cause as a setState call, then used DescribePoint to backtrack the cause of the setState.
As with Agent #1, it traced down into useActionQueue, then read much of the app source. This time, though, it found <SendAnalyticsData>, and immediately flagged sendAnalyticsData as a Server Action. It used a mixture of source reads + hit counts, logpoints, and dependency graph checks to confirm the timing of the calls, and concluded the server response came back while the Router was still mid-redirect.
The suggested fix was indeed replacing the Server Action with a simple fetch call for the analytics request.
Total time: 12 minutes
Bug #3 Agent #2 Prompt
Bug #3 Agent #2 Solution

Bug #3 Agent #3: App Context and Skills

Agent #3’s instructions added some basic context about the app setup, and access to the same React / Next / TanStack skill files we used earlier.
And yet it did worse than Agent #2!
It started by reading the skills files, checked RecordingOverview and ReactRenders and some screenshots to establish a timeline, and read through the app sources. It did read <SendAnalyticsData> and looked for places that component was used.
However, it then got lost in the weeds of Router, useActionQueue, and use hook semantics, down to investigating whether use adds a new hook entry to the rendering fiber. It did at least start to question whether this bug was at the app level or internals level :)
Ultimately, its solution described the sequence of events fairly well, but its conclusion was to move the redirect into middleware, which isn’t correct.
This one’s a bit puzzling - maybe the extra skills gave it too much info on Next and React? :)
Total time: 16 minutes
Bug #3 Agent #3 Prompt
Bug #3 Agent #3 Solution

Bug #3: Initial Results

After three sessions, I had mixed results for this bug. The good news was that the Replay recording was enough for Agent #2 to actually find the real no-kidding issue! That’s a huge step forward over Nadia’s results. Still, I wanted to see if there was anything else I could do to improve the analysis here.
Let’s say this was a purpose-built debugging agent, running against a known CI failure. In that case, we’d be able to provide some additional context to describe the app and it’s structure. That might be useful.
Beyond that, the two agents that didn’t get the answer spent too much time diving down into the internals of Next and React. There was useful information there, but that’s also a rabbit hole that trapped them. What if we gave directions to limit any internals investigation to “20%” effort, and focus 80% on the app itself? Would that curb the tendencies to dive too far?

Bug #3 Agent #4: App Context and 80/20 Guidance

I kicked off Agent #4 with a key update to the previous instructions:
💡
Follow this structured investigation process. Budget your time: spend ~80% on app code, component tree, effects, and state flow. Spend at most ~20% reading framework internals. If you find yourself multiple levels deep in Next.js router source or React reconciler code, stop and ask: "Do I already know enough to identify which app component is the trigger?"
I also refined the investigative process steps to encourage it to focus on actionable results, and to lead the report with a focus on the trigger.
As usual, Agent #4 started by calling RecordingOverview , saw the redirect error, checked some screenshots, used ReactRenders to check some of the render causes, and also read ReactExceptions to get more details on the error.
However, it then focused on reading the source code for the components… and immediately identified that <SendAnalyticsData> was calling a Server Action inside of a useEffect!
From there, it dug further into the dependency graph triggers for the renders. Then it examined the network requests, found the analytics POSTs, and confirmed the timing compared to the redirect error. It did start to dive into React, but actually stopped and explicitly said “I’ve been spending too much time on React internals - let me step back and focus on the app-level mechanism”. Clearly that part of the instructions worked :)
The final solution and fix were exactly right! It pointed to the Server Action in a useEffect running while Next was mid-redirect, gave the interleaved timeline of events, and specifically recommended using fetch or navigator beacons instead of a Server Action.
Total time: 10 minutes
Bug #3 Agent #4 Prompt
Bug #3 Agent #4 Solution

Bug #3 Conclusions

This one was fascinating! Let’s see how the results looked using that updated comparison rubric:
Run 1 (baseline)
Run 2 (+methodology)
Run 3 (+ctx+skills)
Run 4 (best CI)
Did it fix the problem?
Did it identify the root cause correctly?
Did it investigate beyond the first fix?
⚠️ Wrong direction
⚠️ Wrong direction
Did it check the rest of the codebase?
Partial
Did it produce the correct architectural fix?
With Bug #2, we saw a pretty consistent stairstep of improvements as we added more context and skills. Here, the agent either nailed everything or got lost in the weeds.
Agent #3 is still the one that confuses me the most. It had the same process instructions as #2, but with added app context and React skills. Somehow that led it off into the guts of React and it never actually figured out the app-level issue.
I was very pleased to see that the instructions to focus on app concerns had the desired results! In fact it was hilarious to read the LLM actually saying “oops I’m going too deep, back up!”

Overall Results

Nadia’s LLM runs got partial credit on Bug #2 (the double loading skeleton bug), and failed completely on Bug #3 (the broken redirect + Server Actions call).
By adding Replay recordings and improving the prompts with better investigative instructions, I saw my agents go from a partial fix of Bug #2 to nailing the exact cause and solution, and two of the four runs for Bug #3 also come up with the exact cause and solution as well!
This was very exciting! Of course I hoped that adding Replay would produce better results - that’s why I tried doing this in the first place. But I had no way of knowing for sure without actually running the experiment and seeing what happened.

Replay MCP’s Tools Were Critical

In all these explorations, the agents heavily used Replay MCP to understand both the app runtime behavior and Next + React’s internals. They consistently oriented themselves with RecordingOverview to surface app details, errors, and renders; used Screenshot to visualize what the app looked like at various points; and dug into behavior details with ReactRenders and NetworkRequests.
For the deep-dive exploration, the agents used SearchSources to find relevant files and ReadSource to view sources with hit counts, Then they dug deeper with numerous Logpoint and Evaluate calls to dynamically introspect behavior at key lines and hits.
As we saw from Nadia’s post and a couple of my example agent runs, just having the source code available wasn’t enough to solve these issues! Reading the app source was important, and it helped the agent understand the structure of the app codebase and find things like the Server Action. Both parts were necessary. But without the runtime data from the Replay recording, the agents could never have understood the full picture of the apps behavior, and wouldn’t have been able to nail the results.

Prompting and Context Still Matter

As I talked about in the previous post on Dan’s useDeferredValue bug: I really was surprised just how much the prompt and instructions mattered here, and how much of a difference they made!
I’m sure there’s plenty of academic studies out there that talk about all these aspects in more scientific detail. I’m just one dev who tried running some experiments locally :) But this really has impressed upon me the importance of providing good context and instructions for any autonomous agent investigative work. (And, for that matter, helps explain why my own personal hands-on agentic development workflow seems to produce good results for me.)

Replay MCP: The Time-Travel Superpower Your Agents Need

As I said in the last post: I joined Replay because I saw the potential and promise of time-travel debugging and how it can make the debugging process drastically faster and better. It enables solving bugs that are otherwise impossible, and provides insights into real runtime behavior that no other monitoring or analysis tool can provide.
Replay DevTools gave humans the ability to investigate and solve bugs with time travel.
Now, Replay MCP gives your agents the same time-travel superpowers.
And, even better: we’ve updated our Replay Github bot to automatically trigger time-travel analysis on failed E2E tests, and provide root-cause and a recommended fix as a comment on your PR! This will help your team fix flaky tests, ensure PR code quality, and improve shipping velocity. One step closer to self-healing CI/CD!
Try out Replay MCP and our E2E Test Suites integrations today, and stay tuned - we’ve got a lot more awesome time-travel debugging capabilities coming soon!
Related posts
post image
An example of how Replay MCP enables agents to find and fix deep bugs
Vibe-coding apps in 60 minutes
On accessible software development
Powered by Notaku